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ABSTRACT 

We introduce an ordinal classification algorithm for photometric redshift estimation, 
which significantly improves the reconstruction of photometric redshift probability 
density functions (PDFs) for individual galaxies and galaxy samples. As a use case 
we apply our method to CFHTLS galaxies. The ordinal classification algorithm treats 
distinct redshift bins as ordered values, which improves the quality of photometric 
redshift PDFs, compared with non-ordinal classification architectures. We also propose 
a new single value point estimate of the galaxy redshift, that can be used to estimate 
the full redshift PDF of a galaxy sample. This method is competitive in terms of 
accuracy with contemporary algorithms, which stack the full redshift PDFs of all 
galaxies in the sample, but requires orders of magnitudes less storage space. 

The methods described in this paper greatly improve the log-likelihood of indi¬ 
vidual object redshift PDFs, when compared with a popular Neural Network code 
(ANNz). In our use case, this improvement reaches 50% for high redshift objects 
(z Si 0.75). 

We show that using these more accurate photometric redshift PDFs will lead to a 
reduction in the systematic biases by up to a factor of four, when compared with less 
accurate PDFs obtained from commonly used methods. The cosmological analyses we 
examine and find improvement upon are the following: gravitational lensing cluster 
mass estimates, modelling of angular correlation functions, and modelling of cosmic 
shear correlation functions. 

Key words: galaxies: distances and redshifts, catalogues, surveys. 


1 INTRODUCTION 

The determination of distance, or redshift, estimates to 
galaxies is a vital requirement before using large scale pho¬ 
tometric galaxy surveys for many cosmological analyses. 
Large scale surveys, such as the SDSS ( York et al.||2000a l, 
PanSTARRS ( Tonry et al.|2012 |, DES ( Flaugher|2005 1 and 
LSST ( |Tyson et al.|2 003 ) rely on a combination of photomet¬ 
ric and more accurate spectroscopic redshifts when provid¬ 
ing distance estimates to photometrically identified galaxies. 

Photometric redshifts are used throughout astrophysics 
and cosmology, for example in large scale structure analyses 
( |Staniszewski et al.||2009| |de Simoni et al.]|2013| ), in galaxy 


cluster weak lensing analyses (Gruen et al. 2013), and in 
galaxy-galaxy lensing analyses ( Brimioulle et al.|2013 1. Pho¬ 
tometric redshifts are obtained using either machine learn¬ 


ing methods or template fitting techniques (see e.g., Benitez 
2000 Csabai et al.||2000l |Bender et al.||2001| |Ilbert et al. 


2006 


Feldmann et al. 2006| Greisel et al. 2013|) . Machine 


learning techniques range from early works employing ar¬ 


tificial Neural Networks (Firth, Lahav & Somerville 2003 


Collister & Lahav 2004) as photometric point predictors, 


to recent developments that estimate the full photometric 
redshift PDF of the galaxy (Lima et al. 2008 Cunha et al. 


2009 Carrasco Kind & Brunner 2013 Bonnett 20151. For 
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2 Rau et al. 


detailed reviews and comparisons of different photometric 
redshift techniques we refer the reader to |Sanchez et al.| 
(20141; Hildebrandt et al. (20101; Dahlen et al. (20131. This 
work focuses on machine learning methods for photometric 
redshift PDF estimation for samples of galaxies (hereafter 
sample PDF) as well as individual galaxies (hereafter indi¬ 
vidual PDFs). We apply the results to a range of analyses 
in weak gravitational lensing, cosmic shear and large scale 
structure. 

In general, machine learning algorithms learn a map¬ 
ping between the photometry of an object and the spec¬ 
troscopic redshift. To train the machine learning models to 
learn this mapping, one typically identifies spectrophotomet- 
ric data that overlaps with the photometric feature space of 
the final data sample for which one would like to estimate 
redshifts. However, recent work shows that machine learning 
can also be performed with spectroscopic reference data that 
is brighter than the photometric sample (Hoyle et al. 2015a). 
Many photometric surveys include a dedicated spectroscopic 
follow up program, which allows such a machine learning 
system to be built, e.g., SDS S-I/II (|York et al.|2000b l, 2dF 
( Colless et al.|200T l, VVDS ( Le Fevre et al.|2005 (, WiggleZ 
( Drinkwater et al.||2010 l. 

The mapping obtained with machine learning is only 
approximate: the redshift of an object cannot be exactly de¬ 
termined by its corresponding photometry. Moreover, most 
machine learning methods produce a point estimate, which 
reduces the individual PDF to one number. The point es¬ 
timate only predicts the most likely value of the redshift, 
irrespective of the quality of the photometry, and the shape 
of the distribution. In order to enter the era of precision 
cosmology, one must be able to incorporate the uncertainty 
in the redshift estimate into the cosmological analysis. This 
means that the use of single point redshift predictions is no 
longer sufficient. To achieve precision cosmology, we are re¬ 
quired to incorporate the full redshift uncertainty using the 
individual PDFs. 

We can obtain a sample PDF by stacking the individual 
PDFs. This distribution describes the probability that a ran¬ 
domly sampled galaxy has a certain redshift. The accurate 
estimation of the redshift distribution of the full sample is 
important for many cosmological analyses, e.g, in large scale 
structure, weak gravitational lensing, and cosmic shear. 

However, effectively estimating and storing the photo¬ 
metric redshift PDF instead of the point estimate, for each 
object in a large astronomical dataset, is a challenging task. 
This process requires efficient and accurate photometric es¬ 
timation algorithms, and scalable data storage solutions. 
These algorithms must be benchmarked using carefully con¬ 
structed performance metrics to be useful for the next gen¬ 


eration large scale structure photometric surveys (e.g., Lau- 
reijs et aL||2011 1. 


We discuss such metrics to quantify performance of pho¬ 
tometric redshift PDF estimation in We describe the Or¬ 
dinal Class PDF (OCP) algorithm in j |3.2[ which improves 
the estimation accuracy over commonly used non-ordinal 
classification architectures. We continue in §3.4| by showing 
how the OCP method can become more storage efficient, by 
combining it with the Gaussian mixture model. This enables 
the storage of the PDFs of individual galaxies even within 
massive datasets without significant demands on disc space. 

Many applications in cosmology require an estimation 


of the sample PDF. We propose a single point estimator for 
this quantity in §3.5| and show how this single floating point 
number can be computed very efficiently and achieves good 
performance when compared with algorithms that stack in¬ 
dividual PDFs. The performance of the proposed techniques 
is demonstrated and analysed in a method comparison in 
(5.1 and (5.2 using a spectrophotometric dataset ((j4| ob¬ 
tained from the public CFHTLS WIDE survey. 

Finally, we demonstrate in §5.3| that the methods in¬ 
troduced in this work improve the precision of gravitational 
lensing cluster mass estimates, measurements of angular cor¬ 
relation functions, and analyses of cosmic shear correlation 
functions, when compared with results obtained using a 
common Neural Network code. We conclude and summa- 


2 FUNDAMENTAL CONCEPTS 

The following section gives a brief review of important statis¬ 
tical concepts needed in this work. We start with a short in¬ 
troduction to density estimation, introduce metrics to quan¬ 
tify the performance of density estimators and finally de¬ 
scribe a scheme to assess the performance of a machine 
learning model. 


2.1 Kernel Density Estimation 


The goal of kernel density estimation is to find a good es- 
timatoiQ p(x) for the probability density function p(x) of 
a random variable X using N samples X;. Consider a small 
region 1Z centred on a point x. We can then assume that 
p(x) is approximately constant across 1Z. Based on this as¬ 
sumption we can estimate the density at point x as 

w-ik- (1) 

The number of object^] k in Eq. [l] can be estimated by 
considering a D dimensional hyper cube with volume 

Vn = h D (2) 

centred on the point x with side length h. Using Eq. [I] we 
obtain k as 


( 3 > 


where 

A-(d),/ 1 ' I*' 

I 0, otherwise 

is an example of a kernel function, 
has discontinuities at the boundaries. The bandwidth h de¬ 
termines how much the kernel density estimate interpolates 
(or smoothes) between the given data points. A bandwidth 
that is too large oversmooths important structures in the 


1 D 


( 4 ) 


Note that, this kernel 


1 In the following we will mark the estimator for a quantity with 
a hat. 

2 Fixing the number of points k that fall into 1Z and estimating 
the volume Vjz leads to the k nearest neighbour density estima¬ 
tion technique (see e.g. |Scott|1992j ). 
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Accurate photometric redshift PDF estimation 3 


density whereas one that it too small leads to a noisy den¬ 
sity estimate. The density estimate p(x) can then be written 
as 



i=l 


1 N 

( 5 ) 


Instead of using Eq. [4] which has discontinuities at the 
boundaries, we can alternatively use smooth and symmetric 
functions, for example, a Gaussian. 

The estimation of photometric redshift PDFs for indi¬ 
vidual objects (individual PDFs) is an application of con¬ 
ditional probability density function estimation, since the 
individual PDF p(z|f) is conditional on the objects pho¬ 
tometry f. The estimation of conditional probability density 
functions can be formulated in close analogy to Eq. [5] We 
can estimate the individual PDF p(z|f) as a weighted kernel 
density estimate in redshift space of the form 


JVtr 


p(-i f )= y. wi{t)K(z, Zj pec , k ), 

i= 1 


( 6 ) 


using a dataset, the so called training set, containing JVt r 
objects. K(z, z° pec , h) denotes a kernel function with band¬ 
width h centred on the spectroscopic redshift values 2 “ pec . 
The weights Wi(f) sum to unity and depend on the photom¬ 
etry f of the object. 

The conditional cumulative distribution function .F(z|f) 
defined as 


F(z\f) = f p{z\f)dz 

■J — OO 

can be estimated ( Meinshausen|[2006 1 as 


JVtr 


E(.s|f) = Wj{f)I(zl pec ^ z). 


(7) 


( 8 ) 


7(s| pec ^ z) equates to unity if z^ pec ^ z and to zero other¬ 
wise. 

The redshift distribution p(z ) of a sample (sample PDF) 
containing N objects can be estimated by stacking the indi¬ 
vidual PDFs 


N 

p(z ) = Wsta.ck,iP(;?|fi) . (9) 

i= 1 


The normed weights repack,i can be set to 1/N or chosen 
to give more weight to certain sub populations. For example, 
we can favour certain redshift intervals « £ [a, 6 ] by defining 
weights as 

w a t,ck= [ p(z\t)dz = F(b\f) - F(a\t), (10) 

J a 

and we show an example of such a weighting in §5.2| The 
above weights are normalized afterwards to sum to unity. 


2.2 The Gaussian Mixture Model 


In this paper, we consider kernel density estimators and 
Gaussian mixture models for density estimation. A Gaus¬ 
sian mixture model (see, for example, Bishop] [2006 1 for the 
probability density function p(x) of a random variable X is 
a linear combination of K normal densities defined as 


p(x) = y^ aiJ\f (x, cjj) 


( 11 ) 


where cti is the amplitude, pi is the mean, and cn is the 
standard deviation of the mixture component i. 

We define the weight proportion 7 px) of component k 
as 


7 k(x) 


akJV {x,pk,(x k ) 

YJ' ; o ;A'(.(••//; j ’ 


( 12 ) 


where 7 *,( 2 ;) determines how much a certain component of 
the Gaussian mixture model contributes to the total density 
at point x. 


2.3 Evaluation Metrics 


Consider an estimate p(x) of the true probability density 
function p(x) describing the distribution of the random vari¬ 
able X. We can measure the quality of the estimate p(x) 
by its distance D(p(x)||p(x)) to the true distribution p(x), 
which is generally unknown. The Kullback-Leibler diver¬ 
gence between the true density p(x) and the estimate p(x) 
is defined using the natural logarithm as, 

D(p\\p) = f_j(x)lo g 

A good estimate p for p should minimize D(p\\p). Rewriting 
the logarithm we obtain 


dx. 


(13) 


D(p\ \p) = / p(x) log (p(x)) dx - / p(x) log (p(x)) dx , 

J —OO J — OO 

(14) 

and we note that the first term is a constant that does not 
depend on the model parameters, for example bandwidth, 
kernel or shape of kernel function. Thus, the second term in 
Eq. [14] can be used as a relative measure of the accuracy of 
p(x). If we use the sample mean to estimate the expectation 
with respect to p(x), we obtain the mean negative log likeli¬ 
hood loss, hereafter MNLL, (Habbema, Hermans & Van den 
Broek|pti| |Duin|1976 1 


MNLL = ^ ^ log (j5(x,) + e), 


(15) 


where we set e = 1 CP 6 to avoid floating point underflow. 
The Kullback-Leibler Divergence is a distance and thus non 
negative and it is smallest if the MNLL is smallest. 

A suitable loss function for individual PDFs can be de¬ 


fined analogously (see e.g., Takeuchi, Nomura & Kanamori 


2009 Frank & Bouckacrt 2009 Sugiyama et al. 20101. We 
estimate p(z|f,) for each of the N objects in the sample, in 
order to establish performance using a sample of objects for 
which spectroscopic redshift values have been observed. We 
then evaluate p(z\fi) at the object’s observed spectroscopic 
redshift p{z = 2 sp ec,i|f;)- In the rest of the paper, the ab¬ 
breviation MNLL refers to the mean negative log-likelihood 
loss evaluated for individual PDFs. 


2.4 Model Training 

We randomly sample three non-overlapping datasets with¬ 
out replacement from the available data: the training set, 
the validation set and the test set. The model is trained on 
the training set and the model parameters are chosen by 
testing the performance of the trained model with different 
parameter settings on the validation set. 
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4 Rau et al. 


The validation set is used during model tuning and 
therefore it does not provide a good estimate of the per¬ 
formance on unseen data. We measure this generalization 
performance on a test set that is not used during training 
and tuning. 

To evaluate the machine learning algorithms, we con¬ 
struct a training set containing 9000 objects, a validation 
set containing 3000 objects and a test set containing 22072 
objects. After the validation set has been used to determine 
the best combination of model parameters, we merge the 
training set and the validation set and train the respective 
model again with this best setup. In this way, we make op¬ 
timal use of the available data to build the final model. All 
results described in !j5]are obtained on the test set, which, 
we reiterate, was not used in all prior steps of model training 
and tuning. 

In this work we choose to use the aperture magnitudes 
of the CFHTLS WIDE five band photometry as input at¬ 
tributes. Other photometric features may be used, for exam¬ 


ple see Hoyle et al. (2015b) for a feature importance analy- 


3 ALGORITHMS 

We have introduced the estimator for the photometric red- 
shift PDF of individual objects (individual PDF) in Eq. [6] 
as a weighted kernel density estimate that depends on the 
weights w(f). The following section discusses two algorithms 
that can be used to estimate these weights. 


3.1 Quantile Regression Forest (QRF) 


The Quantile Regression Foresf]^] (|Meinshausen 


2006) is a 

generalization of the Random Forest (|Breiman] 200l| that 


can be used to reconstruct individual PDFs, an algorithm 
known as TPZreg (Carrasco Kind & Brunner 2013) in as¬ 
trophysics. 

A regression/classification tree partitions the input 
space and returns the mean/majority vote of the response 
values (i.e., the redshift values) of the training set objects in 
each partition as the final prediction for new objects falling 
into that partition. The tree partitions the input data such 
that the training set objects in each partition are most sim¬ 
ilar with respect to their response values. In regression, we 
measure similarity using the sum of squares loss function 
SSE, defined as 


i 

SSE = Y, J2 (ZBpec.i - <2spec,T» 2 . (16) 

t= 1 fi€K T 

The sum runs over all l leaf nodes of the tree 1 ^ r ^ I, which 
each represents a certain partition 1Z T in input space, and 
over all objects in the training set (fi, 2 sp ec,i) with attribute 
values f; that fall into 1Z T . The term (zspec.-r) denotes the 
mean spectroscopic redshift of all training set objects that 
fall into 1Z T . 


3 The method was originally developed to estimate conditional 
quantiles hence the name Quantile Regression Forest. 


The binary tree is recursively grown by choosing a split¬ 
ting attribute and split point for each region using brute- 
force search such that the SSE is minimized. 

The Random Forest algorithm combines several trees 
by bootstrap aggregation which is described as follows. New 
training sets are drawn from the original training set with 
replacement, which is also known as bootstrapping. We train 
a tree model on each of these bootstrapped training sets, to 
obtain an ensemble of trees. Combining the estimates from 
all trees in the ensemble reduces variance. In addition, the 
Random Forest algorithm makes the resulting models even 
more diverse by modifying the way each tree is grown. Before 
each split selection, the routine randomly selects a certain 
number of attributes, as specified by the ‘mtry’ parameter, 
on which the algorithm can perform the split. 

The complexity of the tree model is governed by the size 
of the leaves of the tree. We stop the recursive tree build¬ 
ing process when a specified minimum number of objects in 
each leaf, denoted as ‘nodesize’ is reached. If the nodesize 
is small, very complex trees are grown and the tree might 
overadapt to the training set. This is an example of overfit¬ 
ting. The prediction from the Random Forest is the mean, 
in regression, or the majority vote, in classification, of the 
predictions from the ensemble of trees. 

A single tree in the Random Forest splits the space 
spanned by the input attributes derived from the photom¬ 
etry of the objects into partitions which are represented by 
the tree leaves. Each leaf defined in this manner is associated 
with the mean spectroscopic redshift value of the training 
set objects in this leaf. The tree therefore approximates the 
underlying smooth function by a step function. If a new 
object is queried, it will be placed in a leaf containing ob¬ 
jects with similar photometry. Following the formulation by 


prediction 


Meinshausen (20061, we can write the photometric redshift 


JV tr 


2phot(f) = Wi(f)z s] 


(17) 


as a weighted sum over the spectroscopic redshift values 
Zspec.i of the N tr training set objects. In order to distin¬ 
guish the different trees in the ensemble, which are charac¬ 
terized by different split selections, we introduce a parameter 
9, which characterizes each tree. All training set objects with 
photometry f* r that are located in the same region IZkjj) 
(defined by the leaf l( f, 9)) as the newly queried object with 
photometry f, get a constant weight, and all other training 
set objects get zero weight. This can be written as 


Wi{ f, 6 ) 


I (f* r G 77-!(f,e)) 

Efrr^ff £7V, e) ) ’ 


(18) 


where the weights are normalized such that they sum to 
unity. 

The same concept holds for the Random Forest predic¬ 
tion, in which the weights associated with each training set 
object are averaged over k trees, each grown on different 
bootstrapped datasets, and therefore each described by a 
different parameter 9 b : 


w i (f)=lY /Wi (f,9 b ). (19) 

^ 6=1 

The weights can be used to estimate the individual PDF and 
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QRF/HWE OCP NOCP OCP GMM 


nodesize 
mtry 
BW mod 


3,5,7,10 1,2,3,5,7,9 1,2,3,5 1,2,3,5,7,9 

1 , 2 , 3 , 4,5 1 , 2 , 3 , 4,5 1 , 2 , 3 , 4,5 1 , 2 , 3 , 4,5 

0.5,0.6,...,1.8,...,3.0 0.5,0.6,...,2.5,...,3.0 0.5,0.6,...,2.0,...,3.0 - 


Gauss Comp. 


1,2,3 


Table 1. Model parameters of the Quantile Regression Forest (QRF), the classification based PDF estimation algorithms (OCP/NOCP) 
and the OCP algorithm used with the Gaussian mixture model OCP GMM. ‘nodesize’ and ‘mtry’ are model parameters of the Random 
Forest described in ^3] ‘BW mod’ is the bandwidth modification factor employed in the Scott’s rule (Eq. |24[ ) and Gauss Comp, denotes 
the maximum number of components allowed in the Gaussian mixture model. The best parameter configuration for each algorithm picked 
on the validation set during model tuning ( ^2.4| is marked in bold type. 


z i <z<z 2 

z 2 < z<z, 

z 3 <z < z 4 

z 4 <z<z 5 

z,<z<z. 

z,<z<z 5 


Zi<Z<Z 3 

z 3 <z<z 5 


Z!<Z<Z 4 

Figure 1. An illustrative example of a nomir 
problem with four redshift bins. These bins can 

Z 4^ Z<Z 5 

al classification 
3e reformulated 


into three binary classification problems by merging neighbouring 
bins. The class probabilities from the binary classification prob¬ 
lems can be recombined to incorporate the ordering between the 
redshift bins (see text) into the final classification. 



Figure 2. Ordinal classification can result in non monotonic cu¬ 
mulative distribution functions. We calibrate them using isotonic 
regression. Isotonic Regression (black) approximates the original 
estimate (red) as a monotonically increasing step function. 


corresponding statistics like the conditional mean, the con¬ 
ditional cumulative distribution function or the conditional 
standard deviation defined as 

N t r 

(T 1 2 3 (a|f) = i) (z sp ec,i - 2phot(f*)) 2 ■ (20) 

i=l 

The following section introduces an alternative way of esti¬ 
mating the weights in Eq. [6] using a classification scheme. 


3.2 Ordinal Class PDF (OCP) estimation 


The basic idea of classification-based PDF estimation is to 
bin the spectroscopic data by redshift and use a classification 
algorithm that outputs probabilities for bin membership to 
reconstruct the PDF. Bin membership is viewed as an ordi¬ 
nal variable. Ordinal scale variables, in contrast to nominal 
ones, exhibit an intrinsic order. If the classes in a classifi¬ 
cation problem are ordinal, we can use this information to 
improve the classification (Frank & Hall|2001). 

Current classification-based PDF estimation methods 
in the astrophysics literature (e.g., Bonnett||2015 Carrasco 
|Kind fc Brunner 2013| treat redshift bins as nominal classes. 
In the following, we will refer to the latter as the non-ordinal 
class PDF (NOCP) algorithm. The ordinal class PDF (OCP) 
algorithm trains a separate classiher that estimates the prob¬ 
ability p{z ^ z{) that a new object has redshift 2 above a 
certain threshold Zi given by the edge of the respective red¬ 
shift bin. This scheme is illustrated in Fig. [T] The probability 


that the redshift of an object resides in the original bins is 
then calculated from these separate classification models as 
( Frank fe Hall|2001| > 

1 p(z € [zi,z 2 \} = 1 -p{z > Z 2 ) 

2 p(z £ [zi-i,Zi[) =p{z > Zi- 1 ) - p{z ^ Zi) 

3 p(z e \zk-i,Zk\) =p(z > Zk-i) , 1 < i < k . 


The reconstruction of the class probabilities p(zi) has the 
idealistic assumption that each of the classifiers used to es¬ 
timate the probability p(z ^ Zi) outputs perfect probabil¬ 
ities. In practice, this will not be the case and the recov¬ 
ered cumulative distribution function, which is a monotoni¬ 
cally increasing function, has to be calibrated. Scha pire et al.| 
(20021; Frank & Bouckaert (20091 use a heuristic approach 
to ensure this monotonicity requirement. Alternatively, we 
use the ‘isotonic’ regression technique to calibrate the class 
probabilities. Isotonic regression is synonymous for mono¬ 
tonically increasing regression and is a technique for which 
efficient implementations are available ( |de Leeuw, Hornik &;| 
Mair|2009 ). For increasing bin index, isotonic regression op¬ 


timizes the mean squared error between the original function 
values and the isotonic fit such that the fit is a monotonic 
increasing step function as shown in Fig. [2] 

We use bins of fixed size Az = 0.01 in the range between 
the minimum and the maximum spectroscopic redshift value 
in the training set, since we found that equal frequency bin¬ 
ning degrades photometric redshift accuracy for catalogues 
with long-tailed sample PDF. The weights sum to unity and 
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are calculated using 


Thus, we modify Eq. [23] 


w»(f) = 


j3(6i|f) 


n b 


( 21 ) 


where n bi is the number of training objects with a red- 
shift value in bin bi. The quantity p(bi |f) is an estimate 
for the class probability that a newly queried object with 
photometry f has a spectroscopic redshift inside the bin bi. 
The method used to obtain the class probability estimates 
p(6i|f) is interchangeable (e.g., using Neural Networks 


Bon- 


nett||20i~5l or the Random Forest [Frank fe Bouckaert|2009[ 


Carrasco Kind fc Brunner| 2013 ). We use the Random For¬ 
est algorithm for consistency with the Quantile Regression 
Forest and implemented the OCP algorithm using the ‘ran- 
domForest’ (Liaw & Wiener 20021 package for the R pro¬ 
gramming language (R Core Team|2013|. 


The original paper by Schapire et al. (2002 I used the 


histogram estimator defined in Frank & Bouckaert (2009) 


-r uv\ V" — b z ) 

p{z l f ) = 2_^ Wi ^ —--■ 

i 'b z 


( 22 ) 


i=i 


Here b z is defined as an index denoting the bin in which 
2 is located and r bz denotes the corresponding bin width. 
We can interpret this histogram as a weighted kernel den¬ 
sity estimate with value r^ 1 for all training set objects in 
a bin specified by b z and zero outside. Frank & Bouckaert 
12009) improved the algorithm using a Gaussian kernel func¬ 


tion and demonstrated its superiority over the histogram 
kernel in numerical experiments on machine learning bench¬ 
mark datasets that are unrelated to the photometric redshift 
problem. 


3.3 Bandwidth Selection 

The algorithms we use to obtain PDFs for individual objects 
require the selection of an appropriate bandwidth for the 
weighted kernel density estimator (Eq. [6]. This section pro¬ 
poses a bandwidth selection scheme that selects the band¬ 
width for the Gaussian kernel during model tuning using the 
MNLL. 

The choice of a proper bandwidth depends on the shape 
of the underlying distribution and the number of objects 
available to construct the estimator. Assuming a normal dis¬ 
tribution and a Gaussian kernel function one can obtain the 
optimal bandwidth as 


fTscott — 1.06 


ATi/5 


(23) 


where a is the sample standard deviation and N denotes the 
number of objects. This so-called ‘Scott’s rule’ is commonly 
used in the machine learning and statistics literature (e.g. 


Takeuchi, Nomura fe Kanamori|2009 Wang fe Wang|2007 |. 
To apply this bandwidth selection rule to weighted data, 
we need to calculate the weighted standard deviation from 
the weighted training set using Eq. [20] Scott’s rule gives a 
good first estimate of a suitable bandwidth for distributions 
which are approximately normal. 

Photometric redshift PDFs are in general not Normal 
distributions and Eq. [23] can pick a non-optimal bandwidth. 


o-acott - a N1/5 , (24) 

with a pre-factor a that is chosen to minimize the MNLL 
on the validation set. We can stack the N te individual PDFs 
in the test set using an individual bandwidth a a for each 
object 

JVte JVtr 

p(z) = E ^stack,a E Wi(fa)AT{z,Zi,<T a ) , (25) 

a=1 i= 1 

or we can use a global bandwidth a a = a. 

3.4 The Gaussian Mixture Model Estimator 

Storing the individual PDFs obtained by weighted kernel 
density estimation for every element in the test set requires 


a large amount of storage. Carrasco Kind & Brunner (2014) 


proposed several different methods, including a Gaussian 
mixture model, to more efficiently store a previously ob¬ 
tained estimate. The authors store individual PDFs using 
10-20 numbers compared with 200 used previously. Instead 
of giving a previously estimated individual PDF a sparse 
representation, we fit the Gaussian mixture model directly 
to the weighted spectroscopic redshift values in the train¬ 
ing set and ensure model sparsity by penalizing the model 
likelihood dependent on the number of components in the 
mixture model. 

More specifically, we fit the Gaussian mixture to the 
weighted spectroscopic data with the expectation maximiza¬ 
tion algorithm (for an introduction see Chen & Gupta]20l01 


2006 


as implemented in the Rmixmod package (Biernacki et al. 
Auder et al.|2014l. In (5.1 and (5.2 during the analy¬ 


sis using CFHTLS, we select the number of Gaussian com¬ 
ponents for each object in the test set using the normal¬ 
ized entropy criterion (Celeux & Soromenhoi 1996;!Biernacki,; 


Celeux fe Govaert|1999 1, appreviated as NEC in the follow¬ 

ing. The maximum number of Gaussian components that 
can be included in the mixture model is a parameter that is 
selected during model tuning as described in §2.4| 

For a A-component Gaussian mixture model fitted on 
the weighted training data, the NEC criterion reads 

E(K) 


NEC(K) = 


L(K) - L( 1) 


(26) 


where L(K) denotes the maximum weighted log-likelihood 


(27) 


l ( k ) = E log E akJ ^ ( z spec,i j 5 Oi) 


for the K component Gaussian mixture model. The entropy 
E(K) is defined as 

K N 

E(K) = - E E W ( fi )7fc (Aspec.i) log ( 7 fe (Zspec.i)) ^ 0 , 
k=1 i= 1 

(28) 

where the definition of the component weight proportions, 
following Eq. |12[ is used. We pick the number of compo¬ 
nents K such that the NEC criterion is minimized, where 
NEC( 1) = 1 (Biernacki, Cele ux fe Govaert|1999| . 

The NEC criterion normalizes the entropy by the max¬ 
imum weighted log-likelihood, in which the offset for a one- 
component mixture is substracted. There are two reasons 
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(Celeux & Soromenho 1996) why we cannot use the en¬ 
tropy E(K) directly. The entropy for K = 1 provides a 
lower bound 


E(K) > E{ 1) VK > 1 (29) 

and the maximum weighted log-likelihood function is an in¬ 
creasing function of K, which makes E(K) unequal for dif¬ 
ferent values of K. The entropy term E(K) measures how 
much overlap there is between the different components of 
the Gaussian mixture model. In the case where the compo¬ 
nents in the model fit completely separated data clusters, 
the entropy term approaches zero. If we select too many 
components, the quantity E(K ) will increase because the 
components will overlap strongly. This can be compensated 
by the higher likelihood of the more complex model. In this 
way, we can efficiently determine a suitable number of com¬ 
ponents to include into the mixture. 


3.5 Highest Weight Element 

A common application for individual PDFs is the estima¬ 
tion of the sample PDF. Storing and processing individual 
PDFs is computationally expensive. We propose the Highest 
Weight Element (hereafter HWE), a single point estimate 
for each object from which we can accurately reconstruct 
the sample PDF. We first run the QRF algorithm to deter¬ 
mine weights as for individual PDF estimation. Instead of 
using the individual PDF, we select the spectroscopic red- 
shift value that is associated with the maximum weight. If 
more than one spectroscopic redshift value has the same 
maximum weight, we randomly select one of those values. 


4 DATASET 


We use photometric imaging data from the CFHTLS Wide 
survey using the following bands u *, g', r', %' and z'-band 
as obtained from the pu b lic CFHTLe nS data release (Er- 
ben_& CFHTLenS Collaboration||2012| |*| We obtain the 
photometry analogously to Brimioulle et al. (20131, i.e., we 


degrade all images to match the band with the worst see¬ 
ing, and use the unconvolved i'-band as the detection band 
and the convolved frames as the extraction band. Then we 
correct for the remaining zeropoint calibration uncertainties 
and varying galactic extinction by comparing the measured 
star colours from the catalogues with predictions of the Pick¬ 
les star library (Pickle s|l9981 . In this way, we eliminate pos¬ 
sibly remaining field-to-field variations in the photometric 
calibration. 

We then match our photometric catalogues to public 
spectroscopic redshift samples. These samples are the Vis¬ 
ible Multiobject Spectrograph (VIMOS) VLT Deep Survey 
(VVDS) flLe Fevre et al. |[2?KT4] [cTarilli el al.pOOSj ), VVDS- 
F22, the Deep Extragalactic Evolutionary Probe-2 (DEEP- 
2 ) program ( |Davis et al.|2007||Vogt et al.|2005||Weiner et al.| 
20051 and the VIMOS Public Extragalactic Redshift Survey 
(VIPERS) ( Garilli et al.|[2014 Guzzo et al.||2014 |. We only 
make use of spectroscopic redshifts with confidence values 


4 http://www.cfhtlens.org 


2.5 



MAG_AUTO i' 

Figure 3. Spectroscopic redshift and MAG AUTO i' distributions 
of the compiled dataset described in f |f] Objects matched from 
different spectroscopic surveys are indicated by different colors. 
We limit the spectroscopic redshift range to z spec < 1.5 in the 
plots excluding 34 objects with higher redshift. 


of at least 95% and only use pointings where the i’-data is 
available and where the i’-band serves as detection band. 

This produces a total sample of 28159 objects with i' ^ 
22.5 and additional 5893 objects with 22.5 < i! ^ 24.5 with 
spectroscopic redshifts and five band photometry. 


5 RESULTS 


Future large area photometric surveys will produce large 
amounts of photometric data for which we need to obtain 
redshift information. Efficiency in terms of runtime and disk 
space will be important in order to use algorithms for photo¬ 
metric redshift estimation effectively on these large datasets. 
Additionally we are required to produce high quality photo¬ 
metric redshift PDFs in order to obtain accurate constraints 
on, for example, cosmological parameters or cluster masses. 

We use the public CFHTLS data described in (|4| to 
compare the accuracy of photometric redshift PDFs esti¬ 
mated by the algorithms described in We show that 
these methods improve the modelling of angular correla¬ 
tion functions, cluster mass estimates, and the modelling 
of shear correlation functions compared to results obtained 


with the Neural Network code ANNz (Collister & Lahav 
|2004[) commonly used in the literature (e.g., |Sheldon et al. 
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r/ cr(Az) (A z) (7 gg 


ANNz 1.23% 0.092 -0.001 0.044 

PhotoZ 2.27% 0.129 -0.008 0.050 


Table 2. Point prediction performance of the Neural Network 
code ANNz and the template fitting code PhotoZ quantified by 
the metrics described in §5.1| 



Figure 4. Density contours of photometric redshift estimates 
from ANNz against the spectroscopic redshift. 



Figure 5. Sample PDF estimated using ANNz and the High¬ 
est Weight Element. The histogram shows the true spectroscopic 
redshift distribution. 



Total 

[0, 0.585[ 

[0.585,0.7488[ 

[0.7488, 3.818[ 

OCP 

- 1.3577 

-1.3905 

-1.6432 

-1.0395 

NOCP 

-1.2847 

-1.3029 

-1.5880 

-0.9648 

QRF 

-1.3483 

-1.3627 

-1.6470 

-1.0347 

GMM 

-1.3181 

-1.3591 

-1.5606 

-1.0354 

ANNz 

-1.1588 

-1.3138 

-1.4891 

-0.6731 


| 2009[|Williamson et al.|2011||Smith et al.|201 2| [Planck Col-| 

[laboration et al.|2015|. 


5.1 Comparison with ANNz 


We train an ensemble of 20 Neural Networks with two hidden 
layers, each consisting of 12 nodes, following the methodol¬ 
ogy described in j]2.4[ 

The photometric redshift estimates obtained from 
ANNz are competitive compared to those obtained with the 
template fitting code PhotoZ ( Bender et al.|200T Brimioullc 
|et al. |[2008[ |Greisel et al.||2013| ) in terms of common pho¬ 
tometric redshift performance metrics. As shown in Table 
[2] ANNz improves upon the photometric redshift perfor¬ 
mance obtained with PhotoZ by 46%, 29%, 88% and 12% in 
terms of outlier rate, scatter, bias and spread of the resid¬ 
uals. The outlier rate rj is defined as the fraction of objects 
with | Zspec — 2 P hot| > 0.15. The bias (A z) and scatter <r(Az) 
are the mean and standard deviation of the distribution of 
the residuals Az = z p h ot — z sp ec■ The spread of the residual 
distribution is measured by the aes metric which is defined 
as half the difference between the 16% and 84% quantile. 

The quality of the photometric redshifts obtained with 
ANNz is illustrated in Fig. [4] It shows a tightly aligned 
correlation between photometric and spectroscopic redshift. 
We estimate sample PDFs from the ANNz point predic¬ 
tions and the stacked Normal densities constructed from 
the ANNz error estimates, in the following referred to as 
‘ANNz-stack’. While showing excellent point prediction per¬ 
formance, ANNz and ANNz-stack do not accurately esti¬ 
mate the sample PDF as shown in Fig. [5] The sample 
PDF constructed from ANNz-stack deviates from the true 
spectroscopic redshift distribution in the central redshift 


Table 3. MNLL of the Quantile Regression Forest (QRF), the 
classification based PDF estimation algorithms (OCP/NOCP) 
and the OCP algorithm used with the Gaussian mixture model 
(GMM). The values are evaluated over the full spectroscopic red¬ 
shift range and in three bins. The result is illustrated in Fig. [6] 


range [0.45,0.85]. We will show in the following sections 
that these deviations from the true spectroscopic redshift 
PDF introduce a systematic bias in several important anal¬ 
yses in cosmology. To compare the quality of photomet¬ 
ric redshift PDFs of individual objects (individual PDFs), 
we evaluate the MNLL (Eq. 15 I of the four discussed al¬ 
gorithms (QRF, NO CP, OCP and OCP GMM) on the full 
range of redshift values and in three redshift bins ([0,0.585[, 
[0.585, 0.7488[ and [0.7488, 3.818[). The results are shown in 
Table [3] and illustrated in Fig. [6] QRF, NOCP and OCP 
employ the weighted kernel density estimate. OCP GMM 
denotes the Gaussian mixture model applied in combina¬ 
tion with weights determined using the ordinal classification 
method described in §3.2| 

We illustrate the relative improvement MNLL re i gained 
by applying these algorithms compared with ANNz-stack 


MNLL re i = f MNL tZV MN ^ 1S ' ) (30) 
V I MNLL ANNz I J 

in Fig. [6] A high value in terms of MNLL re i translates into 
an improvement in the log-likelihood of the individual PDFs 
over those obtained with ANNz-stack. The boundaries of the 
redshift intervals are picked such that they contain approx¬ 
imately the same number of test set objects. All discussed 
methods improve over ANNz-stack. For the highest redshift 
objects, our methods show improvement of up to 50%. The 
OCP routine performs the best and improves the NOCP 
routine. This verifies the superiority of the ordinal classifi- 
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Spectroscopic Redshift 

Figure 6. Relative improvement in MNLL over the performance 
of ANNz-stack. We compare the classification-based PDF esti¬ 
mators (OCP, NOCP), the ordinal classification PDF estimator 
combined with a Gaussian mixture model (OCP GMM) and the 
Quantile Regression Forest (QRF) in three spectroscopic redshift 
bins. The plotted points show the average improvement over the 
full spectroscopic redshift range. 


cation technique. The Quantile Regression Forest performs 
on par with OCP. OCP GMM shows mediocre results, but 
provides the most efficient parametrization using a single 
Normal density per object. 


5.2 Stacked photometric redshift distribution 

Applications like shear tomography require the photometric 
selection of objects in a certain redshift range. We stack the 
individual PDFs compared in §5.1| using weights that quan¬ 
tify their overlap with a certain redshift interval using Eq. 
|10[ These estimates are compared with the weighted ker¬ 
nel density estimate obtained from the spectroscopic red¬ 
shift values using the same weights. The weights are deter¬ 
mined using each of the OCP, NOCP, OCP GMM and QRF 
methods individually. The Highest Weight Element (HWE) 
uses the weighted kernel density estimate with weights de¬ 
termined using the QRF algorithm. We use Scott’s rule to 
choose the bandwidth for the weighted kernel density es¬ 
timates of the HWE and the spectroscopic redshift values. 
The sample PDFs obtained with the OCP, NOCP and QRF 
algorithms are very similar. We therefore restrict the follow¬ 
ing discussion to the OCP method. 

The results shown in Fig. [TJcompare the weighted sam¬ 
ple PDFs obtained with the HWE, OCP and OCP GMM 
methods in the redshift intervals [0,0.585[, [0.585,0.7488[ 
and [0.7488,3.818[. They differ mainly in the amount of 
smoothing present in the estimate. Notably the OCP GMM 
method oversmooths features in the density estimate. This 
is because a single Gaussian was selected during model tun¬ 
ing based on the performance of individual object PDFs. 
Allowing more components reduces the amount of smooth¬ 
ing. The HWE is competitive with methods that estimate 
the individual PDFs, with the advantage that the HWE is 
extremely efficient to calculate and, being a point estimate, 
requires storing only a single floating point number per ob¬ 
ject. 
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The weighted distributions of all methods have tails 
that extend outside the desired redshift range. We can re¬ 
duce these tails by stacking only the objects with the high¬ 
est weight in the respective redshift bin as demonstrated 
in the lower right panel of Fig. [7] We estimate the sample 
PDF from the HWE predictions of the objects with the 5000 
highest weights in the respective redshift bin. The estimated 
weighted sample PDF of these objects has less overlap with 
neighbouring redshift bins, compared with the estimate that 
incorporates all objects. Furthermore it agrees well with the 
equally weighted spectroscopic redshift distribution of the 
corresponding objects. 

Instead of weighting the objects in the respective red¬ 
shift range, we can select objects based on a photometric 


redshift point estimate in analogy to Benjamin et al. (20131. 
We perform the same cut in MAG-AUTO %' < 23.0 and estimate 
the sample PDF in the same photometric redshift intervals 
selected after our ANNz estimate. The results for the HWE 
are shown in Fig. [8] and agree well with the spectroscopic 
redshift distribution. The agreement is better in the central 
bins, which contain more objects, because the histogram ap¬ 
proximates the underlying distribution better. 


5.3 Applications to Cosmology 

We now investigate how the previously discussed meth¬ 
ods can be used to improve analyses that use photometric 
redshifts. We estimate the sample PDF using the Highest 
Weight Element and ANNz. We use kernel density estimates 
with bandwidths selected using Scott’s rule. 

Where required, we impose a flat A-CDM cosmology 
with Q nl = 0.3, Qa = 0.7, n s = 0.96, H = 0.7, <78 = 0.79. 


5.3.1 The Angular Power Spectrum 


The angular power spectrum measures the clustering of 
galaxies and is an important tool to constrain cosmologi¬ 
cal models. 

In the following we adopt the notation of Thomas, Ab- 


dalla & Lahav (2010). Consider the line-of-sight projection 


of the 3D mass distribution in the universe, 620 - The har¬ 
monic modes of 620 are given by 

j3 7 


5e = i 


■'/ 


Pk 

(2-7t) 3 


5{\T)W t (k) 


(31) 


where the window function We{k) is sensitive to the sample 
PDF of light sources, p(z), and can be computed by the 
integral 


W t {k) = j p(z)D(z) 



je(kz)dz . 


(32) 


Here D(z) is the linear growth factor, je(kz) are the Bessel 
functions and relates the redshift to the radial comov¬ 
ing coordinate x. 

The angular power spectrum Ce, is the variance of the 
modes <5^] 

Ci = (5 t 5p = 4t vj A\k)W?{k) d ^ , (33) 


5 In our analysis, we are assuming a galaxy-dark matter bias 
equal to one. 
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1.4 0.0 0.2 

Redshift 


Figure 7. Sample PDFs weighted in three redshift intervals [0,0.585[, [0.585, 0.7488[ and [0.7488, 3.818[. The PDFs are obtained using 
the Highest Weight Element (upper left), the ordinal classification PDF estimator (upper right) and the ordinal classification PDF 
estimator combined with a Gaussian mixture model (lower left). The histograms show the weighted spectroscopic redshift distribution 
using weights determined using the respective algorithms. The lower right panel shows the weighted distribution of the HWE predictions 
for the objects with the 5000 highest weights in the three intervals (blue) and the corresponding weighted histogram of spectroscopic 
redshifts (red). 


where the dimensionless 3D power spectrum A 2 {k) is given 
in terms of the usual 3D matter power spectrum Ps{k) as 


A 2 (fc) 


4tt k 3 Ps(k) 
(27r) 3 


(34) 


From Eq. [32] is can be seen, that the modelling of Ci de¬ 
pends highly on the assumed sample PDF of the data. We 
use the distributions shown in Fig. [5] to model the angular 
correlation power spectrum with the CLASS software pack¬ 
age ( |Blas, Lesgourgues fe Tram| [2011). We define the bias 
introduced by the of the angular correlation function 

estimated using photometric redshifts, as the relative dif¬ 
ference to the results based on the PDF of spectroscopic 
redshifts C| pec : 


Biases 


^-yphot ^-yspec \ 

cr c * ) 


(35) 


The resulting biases are shown in Fig. [9] We find that the 
results obtained with the Highest Weight Element have a 
lower systematic bias in Ci by a factor of four compared 
to the ANNz results and that the improvement is almost 
independent of t. 


5.3.2 Gravitational Leasing 

We investigate two important applications in gravitational 
lensing: quantifying cluster masses by the light deflection 
from background sources, and obtaining cosmic shear corre¬ 
lation functions. In contrast to the previously considered 
analysis of the angular correlation function, applications 
in gravitational lensing require careful selection of sources 
with successfully measured shapes. Since the spectropho- 
tometric dataset used previously is not representative for 
datasets generally used in gravitational lensing analyses, we 
first weight our catalogue such that it mimics a CFHTLS 
shape catalogue. To do this, we obtain a photometric shape 
catalogue from public CFHTLS data, which is then used as 
the reference to weight the spectrophotometric dataset. 


5.3.3 Catalogue Creation and Weighting 

Whether the shape of an object can be measured, depends 
primarily on its intrinsic size and magnitude in the respec¬ 
tive band. We therefore reweight our spectrophotometric 
catalogue such that it resembles a CFHTLS shape catalogue 
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Figure 8 . Sample PDFs estimated using the HWE for subsamples selected in analogy to |Benjamin et al.| (|2013| Fig. 1) using a cut at 
MAG.AUT0 i' < 23.0. The subsamples are selected using the photometric redshift estimates from ANNz in intervals shown in the subfigure 
titles. 



Figure 9. Bias in the angular correlation power spectrum ob¬ 
tained for different estimates for the sample PDF. We restrict the 
comparison to i < 1200 . 


as follows 

Sintr = y/FWHM image 2 - (FWHMpaf) 2 , (36) 

where (FWHM pa f) is the average size of the point spread 
function for the respective chijrj 

In this way, we obtain Si n tr and MAG_AUT0 i entries for 
each object in the shape and spectrophotometric catalogue. 
We now determine weights for the spectrophotometric cat¬ 
alogue such that, after weighting, it matches the size and 
magnitude distribution of the shape catalogue. Furthermore, 
the results obtained with the reweighted spectrophotomet¬ 
ric catalogue have to be robust against the removal of the 
objects with the highest weights ( |Sanchez et al.|2014| . Since 
we do not have enough spectroscopically observed objects to 
mimic the shape catalogue at the faint end, we have to em¬ 
ploy a magnitude cut in order to fulfill both requirements. 
For the analyses presented in §5.3.4| and §5.3.5| we employ a 
magnitude cut at MAG_AUT0 i < 23.5 and MAG_AUT0 i < 23.0, 
respectively. We give a detailed discussion of these cuts in 
the appendix. 


in terms of these properties. We obtain the shape catalogue 
in analogy to Brimioulle et al. (20131 for the full CFHTLS 
survey region. Intrinsic sizes Sintr are calculated for each ob¬ 
ject from the measured FWHMi mage and corrected for seeing 


6 We work on image stacks, but (as in Brimioulle et al. |2013j ) only 
consider objects, for which all images contribute to the stack from 
the same CCD-chip. 
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MAG_AUTO i' 

Figure 10. Distributions in MAG_AUT0 i' band for the original 
spectrophotometric dataset, the re-weighted spectrophotometric 
dataset and the shape catalogue for MAG_AUT0 i' < 23.5. 



Size (Pixel) 


Figure 11. Distributions in intrinsic size (Eq. |36[ ) for the original 
spectrophotometric dataset, the re-weighted spectrophotometric 
dataset and the shape catalogue for MAG AUTD i' < 23.5. 


We combine bootstrap re-sampling with the k nearest 
neighbour estimator to determine weights for the elements 
in the spectrophotometric catalogue, such that the weighted 
catalogue mimics the distribution of the shape catalogue in 
the two dimensional space spanned by the intrinsic size of 
the objects and their magnitude MAG_AUT0 i'. To this end, we 
draw bootstrap samples from the shear catalogue and find 
the k nearest neighbours in the spectrophotometric cata¬ 
logue. The nearest neighbour of an object in the spectropho¬ 
tometric catalogue is the object in the shear catalogue with 
the lowest Euclidean distance to this object. Accordingly, 
the k nearest neighbour algorithm selects the k nearest ob¬ 
jects with respect to the Euclidean distance. The number 
of times an object in the spectrophotometric catalogue is 
selected as one of the k nearest neighbours corresponds to 
its weight. This process is similar to previous work done 
by Lima et al. (2008), which employs a nearest neighbour 
based approach to determine weights for objects in a spec¬ 
troscopic sample to estimate the sample PDF of the pho¬ 
tometric data. In contrast to our method, which is based 
on bootstrap re-sampling, they calculate the density ratio 
between the distributions characterizing the two catalogues 
using a nearest neighbour approach. For the data at hand, 
we draw 10 6 bootstrap samples and consider three nearest 
neighbours k = 3. This method accurately weights the spec¬ 
trophotometric data to mimic the size and i-band magni¬ 
tude distributions of the shape catalogue, as shown in Figs. 
[lO]and[TT] The following analysis uses the estimated weights 
to weight the sample PDF of ANNz, ANNz-stack, the High¬ 
est Weight Element and the spectroscopic data as shown in 
Fig.pl 


5.3.4 Cluster Mass Measurement 


Galaxy Clusters are one of the primary tools to probe the 


A-CDM model (for a review, see e.g. Allen, Evrard & Mantz 
[2011). Cluster masses can be determined by measuring the 



Figure 12. Weighted stacked sample PDF estimated using 
ANNz, ANNz-stack and the Highest Weight Element. The his¬ 
togram shows the weighted spectroscopic redshift distribution. 
We use a cut on MAG_AUT0 i’ < 23.5. The used weights and cuts 
are described in §5.3.3| 


tangential alignment of gravitationally lensed galaxie^] lo¬ 
cated behind the clusters. The accuracy of these weak lens- 
ing mass estimates suffers from uncertainties in the photo¬ 
metric redshift of the lensed sources. In combination with 
other effects such as cluster mass profile variances, they can 


introduce systematics at the 5% to 10% level (see e.g. Ap- 
plegate et al.pOTl l. In the following, we will only consider 
uncertainties due to errors in photometric redshift estimates 


(Seitz & Schneider 1997 

Mandelbaum et al. 2008 Dawson 

et al.|2012 Gruen et al. 

2013 2014 Applegate et al. 2014). 


The excess surface density inside radius R 

(S(r)) 7 ,< i j — E(7?) = Ecrit 7tan(-R) 


(37) 


1 For a introduction into gravitational lensing we refer to Bartcl- 
|mann &: Schneider| (|200 j~[>. 
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Figure 13. Relative bias in the mean critical surface density (Eq. 
|39| ) for different lens redshifts obtained using different estimates 
for the sample PDF. The filled area shows the lcr error interval. 



6 [arcmin] 

Figure 14. Relative bias in the shear correlation function esti¬ 
mate for £_ (Eq. |43| obtained using different estimates for the 
sample PDF. 


is proportional to the critical surface density 

<£ cr ) oc J™ dzp(z) ^ D A^™)Dds(z^ s ,z) ^ (3g) 

of the lens at redshift ZLens- Here Dd, D s and Dd s denote 
the angular diameter distance to the lens, the source and 
between the lens and the source respectively. Uncertainties 
in the sample PDF of background sources p(z) will propa¬ 
gate into systematic errors in the determination of the criti¬ 
cal surface density. This introduces systematic errors in the 
excess surface density and therefore in the cluster mass es¬ 
timate. 

We quantify the systematic bias of the critical surface 
density as 

-rj- ( ^^ cr )photo (^ cr )spec | /on\ 

Bias < E “> = - <^>- * (39) 

\ \ Lr /spec / 



6 [arcmin] 

Figure 15. Relative bias in the shear correlation function esti¬ 
mate for (Eq. |43[ ) obtained using different estimates for the 
sample PDF. 


where (E cr ) photo is estimated from the photometry of the 
objects (e.g., using machine learning) and (E cr ) spec from the 
spectroscopic redshifts. We estimate the error a on this bias 
with respect to our test set containing N objects as 


2 


a 


I ^photo(E C r) 

\pN (Scr) 


(40) 


The mean and standard deviation of the distribution 
of Tier are estimated using the probability density function 
estimates obtained from ANNz and the Highest Weight El¬ 
ement and we present the results in Fig. [13] 

The Highest Weight Element estimate for the sample 
PDF reduces the systematic bias in the critical surface den¬ 
sity compared with ANNz by a factor of four for medium 
lens redshifts 2 G [0.45,0.6]. The systematic bias in (£ cr ) 
obtained from the HWE is consistent with zero for lens red¬ 
shifts z < 0.7 and, in general, outperforms the results ob¬ 
tained with ANNz. Higher lens redshifts are however unre¬ 
alistic for current survey depths. 


5. 3.5 Cosmic Shear 


Cosmic shear is the weak lensing effect generated by the 
inhomogenous matter distribution of the universe and has 
became an important tool to constrain cosmological param¬ 
eters (see, e.g., Kilbinger et al.| 2013] and references therein). 
Similar to our discussion of the angular correlation function, 
we derive a power spectrum P K (£) of the lensing convergence 
k, which is the source of the lensing potential, defined with 
respect to the radial co-moving coordinate x 

r * (^) p * (T 


Pn(t) = 


(41) 


We calculate the power spectrum Ps (|,a;) using the halofit 
formula from |Smith et al.~] ( [200 3]). The lensing efficiency q(x) 
quantifies how strongly the objects in an infinitesimal shell 
of radial comoving coordinates deflect the light coming from 
background sources. Since the radial comoving coordinates 
of the objects are related to their redshifts, the lensing ef¬ 
ficiency q(x) depends on the sample PDF p(z). From the 
lensing convergence power spectrum, we can obtain the two 
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shear correlation functions ( Kaiser| 19921 as 

z±( e ) = ^f o diiP K {i)j 0A m-. 


(42) 


where the Bessel function J 0 (J4) corresponds to the (£_) 
correlation function. In analogy to the previous sections we 
quantify the bias in the shear correlation functions obtained 
from photometric data £P hot by their relative error with re¬ 
spect to the results obtained from the spectroscopic data 

>-spec 
S± > 


( £-photo £-spec \ 

± ^spec ± J ■ (43) 

The results are presented in Figs. 1 14| and |15| We reduce the 
bias in the shear correlation function estimates, using the 
Highest Weight Element estimate instead of the photometric 
redshift estimates from ANNz, by a factor of 12 for £_ and 
a factor of 6 for £ + . 


6 SUMMARY AND CONCLUSIONS 

The next generation photometric surveys will measure the 
positions on the sky of thousands of millions of galaxies. 
We must be able to reliably estimate the distance to, or the 
redshift of, each photometrically identified galaxy before we 
can use these galaxies in analyses to derive the values of 
cosmological parameters. Furthermore to maximize the pre¬ 
cision and accuracy of any derived parameters, we require a 
complete understanding of the full shape of the photomet¬ 
ric redshift probability density function (hereafter PDF) for 
both each individual object, and the entire galaxy sample. 

I 11 this work we develop and discuss methods drawn 
from machine learning, to accurately estimate photometric 
redshift PDFs, which will meet both the future storage de¬ 
mands of large surveys, and the precision demands for cos¬ 
mological parameter estimation. 

As a working example, we apply these algorithms to 
a sample of galaxies selected from the CFHTLS survey for 
a set of cosmological analyses. We demonstrate that these 
methods reduce the biases in all of the analyses examined. 
We also show that these biases result from the mishandling 
of the full shape of the photometric redshift PDFs. 

This advancement is quantified by comparing several 
accurate methods to estimate photometric redshift PDFs for 
individual objects (hereafter individual PDFs). We estimate 
individual PDFs using a classification scheme that classifies 
objects into redshift bins and thereby constructs the PDF 
using the probabilities for bin membership. In contrast to 
the classification-based PDF estimation methodology com¬ 
monly used in the astrophysics literature, we incorporate 
the order of consecutive redshift bins into the classification 
framework. This produces more accurate individual PDFs. 
We quantify the performance of the methods by measuring 
the average log-likelihood of all PDF estimates in a test sam¬ 
ple. Our method outperforms other non-ordinal classifica¬ 
tion and regression schemes, for example classification trees 
and Neural Networks. Specifically, for high redshift objects, 
our method reaches performance gains of over 50% in aver¬ 
age log-likelihood when compared with the results obtained 
using the common Neural Network code ANNz. We con¬ 
struct the individual PDFs using kernel density estimation 


which inherently requires the selection of a suitable band¬ 
width to govern the smoothing scale. We propose an efficient 
method to choose the smoothing scale on an object by ob¬ 
ject basis. We further discuss a Gaussian mixture model, 
whose complexity is adaptively selected for each individ¬ 
ual object, using a criterion that penalizes model complex¬ 
ity. This method shows solid performance compared with 
kernel density estimates, while providing a more efficient 
parametrization of individual PDFs. 

Many cosmological analyses require an accurate knowl¬ 
edge of the full shape of the galaxy sample PDF, instead 
of estimates for the individual PDFs of each galaxy. Sam¬ 
ple PDFs are typically obtained by stacking the PDFs of 
individual galaxies, and so their estimation and storage is 
required. This reconstruction of the individual PDF typi¬ 
cally requires the storage of several hundred floating point 
numbers. Complex post processing algorithms can reduce 
this number to 10 - 20 floating point numbers per object 
at the expense of additional computation time. However in 
this work, we propose a new single point estimator for each 
galaxy, called Highest Weight Element (HWE), which can 
be used to accurately reconstruct the full sample PDF. This 
leads to a significant reduction in the storage requirements 
of future photometric surveys. Furthermore, we note that re¬ 
constructing the full sample PDF using the point estimator 
method described in this paper requires orders of magnitude 
less computation time than using other common redshift 
codes. 

Applications such as shear tomography require the ac¬ 
curate photometric selection of objects in redshift bins. We 
weight photometrically observed galaxies such that their 
sample PDF lies within the predefined redshift range. The 
weights are estimated from the overlap between the indi¬ 
vidual redshift PDFs and the redshift selection interval. We 
further use these weights to improve the selection of a sam¬ 
ple of galaxies, such that their sample redshift PDF is more 
accurately confined to be within the predefined redshift bin. 

We now return our attention to the specific use case 
highlighted above using CFHTLS galaxies. In particular we 
examine the following cosmological analyses: the estimation 
of cluster masses using weak gravitational lensing, the mod¬ 
elling of galaxy angular correlation functions, and the mod¬ 
elling of cosmic shear correlation functions. In each case we 
compare the results, and estimate biases, using results ob¬ 
tained with ANNz. 

For lensing clusters within the redshift interval 0.45 < 
z < 0.6, we show that our methods reduce the relative bias 
in the cluster mass reconstruction by up to a factor of 4. 
Furthermore our methods improve the relative biases in the 
modelling of the explored large scale structure, and cosmic 
shear correlation functions by similar values. 

In this paper we have shown that the usual point es¬ 
timate of a photometric redshift is a poor estimator when 
used to reconstruct the full sample redshift PDF. We note 
that these point estimates are still used in many recent anal¬ 
yses, and we have shown that their continued use can lead 
to large biases in cosmological analysis. By using the new 
HWE point estimator method, highlighted in this paper, we 
show that the full shape of the sample PDF can be estimated 
more accurately and that this reduces the biases incurred by 
mis-estimating the sample PDF. 

The results discussed in this paper have been obtained 
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under the idealized assumption that the data used to train 
the models is completely representative of the test data. In 
applications where this is not the case, data augmentation 
techniques (Hoyle et al. 2015aI can be used to artificially 
populate regions of color-magnitude space, that are not fully 
covered by spectroscopy. These techniques assume a model 
for the data distribution and can be seen as a form of extrap¬ 
olation. Weighting methods (f 5.3.31 are in some cases an al¬ 
ternative to data augmentation. If all relevant attributes are 
included, these algorithms can be used to determine weights, 
such that the weighted dataset resembles a reference dataset. 

To aid the common adoption of these tools and tech¬ 
niques we will make the source code of all algorithms pub¬ 
licly available on the homepage of the first author. 
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APPENDIX: TESTS OF WEIGHTING SCHEME 


The analyses in §5.3.5[ §5.3.4| have been carried out by 
weighting the photospectroscopic dataset such that it re¬ 
sembles a shape catalogue. If only a few objects in the 
reweighted catalogue are given high weights, the analyses 
can strongly depend on these objects. We lack spectrocopi- 
cally observed objects at the faint end of the shape catalogue 
and therefore employ a magnitude cut to avoid giving large 
weight to the faint, unrepresentative part of the spectropho- 


tometric catalogue. In analogy to Sanchez et al. (20141, we 


test the robustness of our weighting scheme with respect to 
the considered applications by excluding the top 5% of the 
objects that are given the highest weights. 

The bias in the critical surface density is robust against 
the exclusion of the highest weighted objects for a magnitude 
cut at MAG_AUT0 i! < 23.5 as shown in Fig. 16 The results 
improve for all algorithms if these objects are removed. The 
conclusions of the analysis, i.e. that the Highest Weight El¬ 
ement (HWE) leads to a lower bias compared with ANNz, 
remain valid. 

The analysis of the biases incurred in estimates of the 
cosmic shear correlation functions requires a more conser¬ 
vative cut at MAG_AUT0 %' < 23.0, to be robust against the 
removal of a small number of highly weighted objects, as 
can be seen in Fig. [Tt] and Fig. |18| For a magnitude cut 
at MAG_AUT0 i' < 23.5, ANN^j gives a better overall result 
compared with the HWE, while the opposite is true if the 
5% objects with the highest weight are left out. 


8 The results for ANNz-stack are very similar. Therefore we do 
not show them here. 



Figure 16. Relative bias in the mean critical surface density (Eq. 
|39| ) for different lens redshifts obtained using different estimates 
for the sample PDF. We show the relative biases obtained for 
the weighted dataset cut at MAG_AUT0 i 1 < 23.5 in solid lines, and 
the corresponding results with the 5 % highest weighted objects 
removed in dashed lines. 


Note that this is not because the p(z) reconstruction of 
ANNz is superior at faint magnitudes. Instead this can be 
explained by considering the bias in the integrand in Eq. [42] 
with respect to the spectroscopic result given as 

Bias = ^ JoA ®) (^ phot W - Pr c (p) ■ (44) 

As shown in Fig. |19| ANNz both partly underestimates and 
overestimates the true spectroscopic integrand at different 
redshift values such that these two effects compensate each 
other. Since the lensing efficiency is dominated by the high 
redshift tail of the stacked PDF, the peculiar shape of the 
ANNz reconstruction in this range happens to outperform 
the otherwise superior HWE method. The shape of the high 
redshift tail strongly depends on a small number of faint 
objects, which are given a high weight. Accordingly, this ar¬ 
tifact is no longer present if the top 5% of the objects with 
the highest weights are left out. For a more conservative cut 
at MAG_AUT0 i' < 23.0, the analysis is no longer dominated 
by a few highly weighted objects at the faint end of our 
spectrophotometric catalogue, the ANNz analysis does not 
outperform the HWE, and the interpretation does not de¬ 
pend on the removal of the objects with the highest weights. 
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Figure 17. Relative bias in the shear correlation function es¬ 
timate for £_ (Eq. |T> [ ' obtained using different estimates for 
the sample PDF. We show the relative biases obtained for the 
weighted dataset cut at MAG_AUT0 i' < 23.5 in solid lines and 
MAG AUTO i' < 23.0 in dashed lines and the corresponding results 
with the 5% highest weighted objects removed. 



Figure 18. Relative bias in the shear correlation function es¬ 
timate for (Eq. |43| obtained using different estimates for 
the sample PDF. We show the relative biases obtained for the 
weighted dataset cut at MAG_AUT0 i' < 23.5 in solid lines and 
MAG AUTO i' < 23.0 in dashed lines and the corresponding results 
with the 5% highest weighted objects removed. 
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